CLAIMS 

What is claimed is: 



ft 



\A method of categorizing a plurality of new electronic documents into a set of 

Srtegories, each of the categories containing a plurality of training set documents, by 
using a matrix representing document similarity that is derived by combining two or 
more measures of document similarity. 

A method as recited in Claim 1, wherein the measures of document similarity include 
hyperlink similarity. 



3. A method as recited in Claim 2, in which two documents among the plurality of 
documents are considered similar to each other when there is a link from one to the 
other, or when the two documents link to, or are linked to by, a set of other associated 
documents. 

4. A method as recited in Claim 3, in which certain hyperlinks have greater or lesser 
similarity weight than other hyperlinks, based on other features of the links or their 
source or destination documents. 

5. A method as recited in Claim 1, wherein the measures of document similarity include 
a similarity of text of the documents. 

6. A method as recited in Claim 5, wherein two documents are considered similar based 

on a comparison of word vectors derived from the text of each of the two documents. 
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7. A method as recited in Claim 5, wherein text similarity is determined in part based 
upon weight values assigned to words of the text, and wherein certain words have 
greater or lesser weight than other words. 

8. A method as recited in Claim 1, wherein the measures of document similarity include 
user click-through similarity. 

9. A method as recited in Claim 8, wherein two documents are considered similar based 
on user click-through similarity when the documents are associated with similar 
patterns of user click behavior, selected from among frequency of clicks, click 
context, duration of viewing, proximity in time to other clicks, or proximity in context 
to other clicks. 

10. A method as recited in Claim 1, wherein the measures of document similarity are 
derived from patterns detected in user viewing of the documents. 

11. A method as recited in Claim 10, wherein the user viewing information is monitored 
by a web caching system and stored in a log. 

12. A method as recited in Claim 10, wherein two documents are considered similar based 
on patterns of user viewing behavior, including frequency of viewing, viewing 
context, duration of viewing, proximity in time to other documents viewed by the 
same user, or similarity of patterns of viewing by all users. 
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13. A method as recited in Claim 1, wherein the measures of document similarity include 
URL similarity. 

14. A method as recited in Claim 13, wherein two documents are considered similar if a 
URL of each document contains similar URL sub-components. 

15. A method as recited in Claim 1, wherein the measures of document similarity include 
multimedia similarity. 

16. A method as recited in Claim 15, wherein two documents are considered similar based 
on features derived from multimedia components linked to or contained by the 
documents. 

1 7. A method as recited in Claim 1, wherein the combination of two or more measures of 
document similarity is achieved by taking the union of each of a plurality of graphs, 
each graph describing one of the measures of document similarity, to compute a 
combined graph that describes the combined document similarity. 

18. A method as recited in Claim 1, wherein the combination of two or more measures of 
document similarity is achieved by taking the intersection of each of a plurality of 
graphs, each graph describing one of the measures of document similarity, to compute 
a combined graph that describes the combined document similarity. 
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j*9. \ method as recited in Claim 1, further comprising the step of extracting structural 

2 inlbrmation from the similarity matrix to obtain new documents supported by the set 

3 of training documents for each category. 

1 20. A method\s recited in Claim 19, wherein the structural information is obtained by 

2 optimizing amobjective function. 

1 21. A method as recitedun Claim 1 9, wherein the structural information is obtained by 

fi \ 

i;:2 only approximately optimizing an objective function. 

ill 

y 

8 : 1 

Y ; : 1 22. A method as recited in Claim 21 , wherein approximately optimizing the objective 

!U 

i,l2 function comprises repeated application of a growth transformation. 

u 

m 

Ul 23. A method as recited in Claim 19, further comprising the step of creating and stonng a 

:ij;|2 second matrix that represents an interim score for each document in each category. 



1 24. A method as recited in Claim 19, further comprising the steps of, periodically as the 

2 matrix is being computed, normalizing rows of the matrix by normalizing within each 

3 document, across all categories, whereby the score for one document in a particular 

4 category will depend on the scores for that document in all other categories. 

1 25. A method as recited in Claim 19, further comprising the steps of, periodically as the 

2 matrix is being computed, normalizing columns of the matrix by normalizing within 
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each category, across all documents, whereby the score for one document in a 
particular category depends on the scores for all other documents in that category. 



1 26. A method as recited in Claim 1 , in which the categories come from a manually 

2 defined taxonomy. 

1 27. A method as recited in Claim 1 , wherein the categories are derived from logs of user 

2 queries. 
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. method as recited in Claim 1, further comprising the steps of creating and storing 
theonatrix using columns representing documents and rows representing user 
sessions, and wherein values of elements of the second matrix represent interest in a 
document shown by a particular user in a particular session. 

A method as recited in Claim 1, further comprising the steps of creating and storing 
the matrix using columns representing user sessions and rows representing 
documents, and wherein values of elements of the second matrix represent interest in 
a document shown by ^particular user in a particular session. 

A method as recited in Claim 28, wherein the element values are computed as a 
function of a time that a user has spent viewing a document associated with each 
element. 
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A method as recited in Claim 28, further comprising the steps of creating and storing a 
second matrix representing a Similarity between pairs of documents i and j, wherein 
the second matrix is derived by comparing pairs of column vectors or row vectors, 
respectively i and j of the first matrix. 

A method as recited in Claim 28, further comprising the steps of creating and storing a 
second matrix representing a Similarity between pairs of documents i and j, by finding 
pairs of documents i and j which have high interest values for a particular user in a 
particular session or period of time. 

The method recited in Claim 1, further comprising the steps of: 

identifying a category of a classification taxonomy of the hypertext system in which a 

first electronic document is presently classified; and 
if a second electronic document is found to be highly Similar, storing information that 

classifies the second electronic document into the category. 

A computer-readable medium carrying one or more sequences of instructions, wherein 
execution of the one or more sequences of instructions by one or more processors 
causes\he one or more processors to perform the steps of categorizing a plurality of 
new electronic documents into a set of categories, each of the categories containing a 
plurality of training set documents, by using a matrix representing document 
similarity that is^derived by combining two or more measures of document similarity. 
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1 36. 



A method of categorizing a plurality of new electronic documents for use in a 
hypertext search system, the method comprising the steps of: 
creating and storing a set of categories for the documents; 

creating ana storing a matrix, in which rows and columns identify documents, and in 
which each element of the matrix stores a value that represents a similarity 
among\a pair of documents associated with a row and column that intersect at 
the element; 

deriving each mataix value by combining two or more measures of similarity that are 
obtained by analys 




A method as recited \n Cl^m-35, further comprising the steps of: 
for each measure of dtocum&pt similarity, creating and storing a graph of links; 
creating and storing a qpmbined graph that combines the graphs and that represents a 

generalized similarity of the documents; 
computing a generalized Similarity value for a pair of documents based on the 
combined graph. 



1 37. A method as recited in Claim v56, further comprising the steps of classifying 

2 unclassified documents into category nodes of a taxonomy structure associated with 

3 the hypertext search system based on the generalized similarity value in combination 

4 with a comparison of a set of pre-cmssified training set of documents with a set of 

5 unclassified documents, to carry out classification. 
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